This notebook shows how to use UDFs (user-defined functions) in Google BigQuery. UDFs allow you to operate on columns of a table, performing arbitrary transformations, and returning the result of those transformations as a value. Datalab currently supports temporary UDFS, which live only within the query that uses them.
You can read more about UDFs here
In this notebook we are going to look at anonymized logs that originated in Google AppEngine. These logs include the paths of requested URIs, which contain a number of query parameters of interest. To help us use the logs, we will create a UDF that extracts the query parameters values, and puts them into a new column of type ARRAY
.
In [1]:
%bq tables describe --name cloud-datalab-samples.appenginelogs.sample_logs_20151027
Out[1]:
In [2]:
%bq sample --count 5 --table cloud-datalab-samples.appenginelogs.sample_logs_20151027
Out[2]:
You can see we have five columns; the 'path' column needs the most processing. Each URI has the form log/event?params
, where params can be one of project, instance, user, page, path, version, or release. We are going to extract these values into a separate column of type ARRAY
.
UDFs are functions written in one of the supported languages (currently SQL and Javascript), that take a column and produces a value, after performing some computation. The BigQuery UDF documentation explains that the CREATE TEMP FUNCTION
call is needed to define a UDF, including its parameter names and types, return type, and language. Datalab simplifies this syntax; it makes use of jsdoc-style // @param
comments to achieve the same result. Also, it exposes UDFs as a Python class, and a magic command, to make building queries simpler. Let's see how we can do this:
In [3]:
%%bq query
SELECT * FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5
Out[3]:
In [4]:
%%bq udf --name extract_params -l js
// A function to split a set of URL query parameters into an array
// @param path STRING
// @returns ARRAY<STRING>
var re = /[?&]([^=]*)=([^&]*)/g;
var result = [];
var match;
while ((match = re.exec(path)) != null) {
result.push(decodeURIComponent(match[2]));
}
return result;
Now we can try calling the UDF. We need to define a query to do this. We can call our UDF like any regular function call, taking one or more columns as input.
In [6]:
%%bq query -n extract_params_query --udfs extract_params
SELECT *, extract_params(path) as parameters FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5
In [7]:
%bq execute -q extract_params_query
Out[7]:
In order to see the actual expanded SQL, including the UDF defined above, we can inspect the query object, by typing its name:
In [8]:
extract_params_query
Out[8]:
You can learn how to test your UDF in the notebook by following the UDF Testing in the Notebook tutorial. If you have code that you regularly use in your UDFs, you can factor it out and put it in Google Cloud Storage, then import it. This technique is covered in the UDFs using Code in Cloud Storage tutorial.